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ABSTRACT 

The  location  of  an  acoustical  source  can  be  found  robustly  using 
the  Steered  Response  Pattern  -  Phase  Transform  (SRP-PHAT)  al¬ 
gorithm.  However  SRP-PHAT  can  be  computationally  expensive, 
requiring  a  search  of  a  large  number  of  candidate  locations.  The 
required  spacing  between  these  locations  is  dependent  on  sam¬ 
pling  rate,  microphone  array  geometry,  and  source  location.  In  this 
work,  a  novel  method  will  be  presented  that  calculates  a  smaller 
number  of  test  points  using  an  efficient  closed-form  localization 
algorithm.  This  method  significantly  reduces  the  number  of  calcu¬ 
lations,  while  still  remaining  robust  in  acoustical  environments. 

1.  INTRODUCTION 

Beamforming  is  often  used  for  removing  noise  and  reverberation 
from  speech  signals  by  taking  advantage  of  spatial  information. 
The  array  response  is  steered  to  concentrate  on  the  signal  at  a 
given  location  and  attenuate  noise  and  interference  from  other  di¬ 
rections.  The  location  is  usually  not  known  and  must  be  estimated 
from  the  data.  Beamformers  do  not  perform  well  in  the  presence  of 
steering  errors,  requiring  accurate  location  estimates  [1],  In  addi¬ 
tion  to  beamformers,  the  location  could  be  used  in  a  joint  camera- 
microphone  teleconferencing  system  [2]  or  for  speaker  segmenta¬ 
tion  [3].  So  source  localization  is  an  integral  part  of  microphone 
array  processing. 

Several  methods  have  been  developed  for  estimating  an  acous¬ 
tical  source  location.  Algorithms,  like  SRP-PHAT,  have  good  ro¬ 
bustness  in  the  presence  of  room  effects  [4],  SRP-PHAT  can  be 
quite  complex  requiring  the  calculation  of  a  large  number  of  test 
points  in  the  region  of  possible  source  locations.  The  location  is 
chosen  to  be  the  point  that  results  in  the  highest  energy  or  like¬ 
lihood.  The  proper  distance  between  points  is  determined  by  the 
mapping  of  the  Nyquist  rate  from  the  time  domain  to  space.  As 
such,  the  number  of  candidate  locations  is  dependent  on  the  sam¬ 
pling  frequency  as  well  as  aperture  size  and  the  range  of  the  source. 
The  number  of  points  can  be  reduced  if  the  source  is  constrained 
to  a  plane  or  the  far  field. 

Alternatively,  the  problem  can  be  implemented  as  a  two-step 
process.  First  the  generalized  cross-correlation  (GCC)  is  used  to 
find  the  time  delays.  Then  those  time  delays  are  used  to  estimate  a 
three  dimensional  location  [2,5].  Frequently  errors,  sometimes 
called  anomalies,  occur  in  the  time  delay  estimates  [6].  These 
anomalies  are  caused  by  strong  reflections  of  the  sound  source, 
which  are  sometimes  greater  in  energy  than  the  direct  signal.  The 


direct  path  can  be  obstructed  or  attenuated  because  of  source  and 
microphone  directivity.  Anomalous  time  delay  estimates  create 
large  errors  in  estimation.  So  while  these  algorithms  are  quite  fast, 
they  lack  robustness. 

Instead  of  blindly  testing  many  candidate  locations,  a  novel 
algorithm,  called  Hybrid  Localization,  will  be  presented.  This  al¬ 
gorithm  is  well  suited  to  locating  a  source  in  the  near  field  with 
a  large  aperture  microphone  array.  Using  multiple  time  delay  es¬ 
timates  from  each  microphone  pair,  it  uses  a  two-step  algorithm 
to  generate  a  set  of  candidate  locations.  These  locations  become 
the  candidate  locations  for  SRP-PHAT.  Although  the  calculation 
of  these  candidate  points  requires  some  computation,  it  reduces 
the  total  computational  cost  compared  to  SRP-PHAT.  This  is  ac¬ 
complished  without  a  decrease  in  the  robustness  of  the  location 
estimates. 

2.  MODEL 

In  the  following  discussion,  the  received  sound  signals  will  be 
modeled  as 

Xi(n )  =  X!  hij  *  Sj(n)  +  t l>i(n)  (1) 

3 

where  Sj  is  the  signal  from  the  jth  sound  source,  hij  is  the  filter's 
impulse  response  between  the  jth  sound  source  and  the  ith  micro¬ 
phone.  The  number  of  sources  is  represented  by  Nsrc  and  M  is  the 
number  of  microphones.  The  noise  for  each  channel  is  represented 
as  %pi{ri)  and  it  is  independent  of  the  noise  in  other  channels.  It  is 
also  assumed  that  all  sources  are  independent  from  each  other  and 
from  the  noise. 

GCC  is  computed  in  the  frequency  domain,  by  converting  a 
frame  of  time  data  using  an  FFT. 

,Y,;  (w)  =  Hij  S3  H  +  ’UH  (2) 

j 

The  Phase  Transform  (PHAT)  is  a  GCC  defined  as 

Rik (w)  =  (3) 

This  equation  is  then  transformed  back  to  the  time  domain.  The 
results  are  used  as  the  energy  function  for  SRP-PHAT  and  to  esti¬ 
mate  time  delays.  In  order  for  (2)  to  be  valid,  the  frame  must  be 
longer  than  the  length  of  the  impulse  response. 
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Example  Cross  Correlation 


3.  HYBRID  LOCALIZATION 
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Fig.  1.  An  example  frame  using  the  PHAT  method.  The  GCC 
consists  of  multiple  delays  embedded  in  noise.  Since  the  direct 
path  is  less  energetic  than  one  of  the  reflections,  this  frame  results 
in  an  anomaly. 


The  impulse  response  can  be  characterized  as  having  three  dis¬ 
tinct  parts.  First  is  the  direct  path.  This  is  followed  by  several  dis¬ 
crete  reflections  and  then  diffuse  reverberation.  The  length  of  the 
impulse  response  is  designated  by  reverberation  time,  Tbo.  This 
is  the  time  that  the  signal  energy  decays  by  60  dB.  Typical  rooms 
have  a  reverberation  time  of  300  to  700  ms  [7],  which  results  in  a 
very  long  impulse  response. 

However,  the  impulse  response  can  be  approximated  as 

Nr 

hi j  «  a  if  6{n-  t£0) )  +  ]T  af)  6{n  -  r )  (4) 

1  =  1 

Delay  elements,  8{n  —  t^),  represent  the  direct  path  and  the  Nr 
strongest  early  reflections.  The  direct  path  is  designated  as  l  =  0. 
The  attenuation  of  the  sound  energy  is  represented  as  a.-y1 .  The 
reverberation  is  included  with  the  noise,  i/>i(n).  So  the  noise  in¬ 
cludes  reverberations  of  the  sources  and  diffuse  noise.  Shorter 
frame  sizes  can  now  be  used,  since  the  reverberation  is  no  longer 
considered  part  of  the  impulse  response. 

The  resulting  GCC  will  consist  of  several  peaks  embedded  in 
a  noise  floor.  The  peaks  correspond  to  the  delays  in  the  direct 
path  and  the  early  reflections.  The  noise  floor  is  caused  by  the 
reverberation  and  noise.  This  results  in  the  following  model  for 
the  cross-correlation. 

rik(n)  =  J2&ik5(n-fik)+Vik(n)  (5) 

l 

where  rjik  (n)  is  the  noise  floor  with  a  variance  of  and  S(n— 
T-'k! )  represents  the  peaks,  corresponding  to  the  delay  elements  in 
(4).  An  example  frame  can  be  seen  in  fig.  1.  If  each  channel  has  Ni 
reflections,  the  resulting  cross-correlation  could  have  as  many  as 
( Ni  4-  l)2  peaks.  In  practice,  the  number  of  significant  reflections 
is  usually  fairly  low. 


The  time  difference  of  arrival  of  the  direct  path  is  non-linearly  re¬ 
lated  to  location. 

Tij  se  |rs  -  nii|  -  |rs  -  rrij  |  (6) 

where  rs  is  the  source  location  and  m,  is  the  ith  microphone. 
Spherical  intersection  (SX)  [8]  and  spherical  interpolation  (SI)  [2] 
use  the  time  delays  corresponding  to  the  maxima  in  GCC  to  cal¬ 
culate  the  location.  Starting  with  (6),  a  set  of  linearlized  equations 
can  be  created  and  collected  into  a  matrix  equation.  For  example, 
SX  is  stated  as 

fs  =  (AtA)"1At(b- A Rs)  (7) 

The  zeroth  microphone  is  located  at  the  origin.  The  matrix  A  is 
composed  of  the  remaining  microphone  locations  and  A*  is  its 
transpose.  R„  is  the  distance  of  the  source  to  the  origin  and  is 
equal  to  the  norm  of  rs .  The  vector  A  is  composed  of  delay  ele¬ 
ments  di o  =  c/FsTio,  where  c  is  the  speed  of  sound  and  Fs  is  the 
sampling  rate.  The  ith  row  of  vector  b  is 

h  =  ^{\\mi\\2  -  $0)  (8) 

Since  Ra  is  unknown,  it  must  be  estimated.  This  is  accomplished 
by  squaring  (7),  substituting  |  |rs  1 12  with  J72  and  solving  the  result¬ 
ing  quadratic  equation.  The  location  is  estimated  with  the  resulting 
Rs- 

Unfortunately,  the  time  delay  estimates  of  individual  cross¬ 
correlations  are  quite  noisy,  introducing  significant  localization  er¬ 
rors.  The  underlying  assumption  of  SX  and  SI  is  that  the  dominant 
peak  is  at  the  correct  time  delay  for  the  source.  However,  fre¬ 
quently  the  estimated  time  delay  corresponds  to  a  strong  reflection 
or  an  interfering  source  near  the  microphone  pair.  Reflections  can 
be  stronger  than  the  direct  path  because  of  an  obstruction  of  the 
direct  path  or  the  directivity  of  the  sound  source.  For  example, 
a  human  speaker  is  not  an  omni-directional  sound  source  and  the 
sound  propagating  in  front  of  a  person  is  more  energetic  than  the 
sound  propagating  behind  [9], 

Although  SX  is  not  robust,  it  is  quite  fast,  so  it  can  be  used  to 
generate  a  set  of  candidate  locations  quickly.  Several  maxima  from 
each  channel  pair  are  used  to  find  several  possible  time  delays.  The 
number  of  time  delays  is  denoted  by  Np.  The  individual  time  delay 
estimates  can  be  combined  in  NpJ  different  combinations,  where 
M  represents  the  number  of  microphone  pairs.  This  results  in  an 
unrealistically  large  number  of  combinations.  So  in  practice,  it  is 
best  to  use  a  subset  of  the  channel  pairs  to  estimate  an  initial  set  of 
locations.  When  using  SX,  only  three  pairs  are  needed  to  create  a 
set  of  points,  which  results  in  Np  candidate  locations. 

The  next  step  matches  the  time  delays  from  the  other  micro¬ 
phone  pairs  to  the  initial  locations. 

4o  =  argrnin  ||m-fc  -  {b\k)  -  Rcd{^)\\2  (9) 

k 

for  all  i  not  in  the  original  set  of  microphone  pairs.  The  time  delays 
for  the  remaining  microphone  pairs  are  chosen  to  minimize  the 
error,  which  is  derived  from  the  SX  equations. 

After  the  best  time  delay  estimates  are  found  for  each  candi¬ 
date  location,  the  locations  are  re-estimated  using  all  the  channel 


Fig.  2.  A  block  diagram  of  the  Hybrid  Algorithm 


pairs  for  SX.  These  locations  are  used  as  the  test  locations,  q,  in 
SRP-PHAT. 

M  M 

r3  =  argmax  EE  T'j  j  ( T/;y  )  (10) 

q  i=  1  j=l 

where  Tij  is  related  to  the  test  location  q  and  r^j  is  the  PHAT 
GCC.  Traditionally,  SRP-PHAT  must  sample  a  large  set  of  points. 
Many  of  these  points  are  very  unlikely  to  be  the  source  location. 
However  the  candidate  locations  generated  by  Hybrid  Localization 
are  all  very  likely  to  be  the  location  of  the  source.  As  long  as  Np 
is  small,  there  are  relatively  few  candidate  locations. 

Frequently,  it  is  known  that  a  source  is  located  in  a  certain 
region.  So  all  candidates  outside  of  this  region  can  be  pruned.  In 
addition,  past  location  estimates  can  be  used  to  predict  the  current 
location.  This  is  called  tracking  and  the  results  of  tracking  are 
included  in  the  set  of  candidates.  These  modifications  increase 
localization  accuracy. 

4.  A  METRIC  FOR  BEST  CHANNEL  PAIRS 

If  all  channel  pairs  were  equally  good,  then  it  wouldn’t  matter 
which  pairs  were  used.  Unfortunately,  this  is  not  the  case.  So  a 
metric  should  be  developed  in  order  to  choose  which  channel  pairs 
best  estimate  the  set  of  initial  locations.  The  metric  used  in  this 
paper  can  be  developed  using  (5).  Due  to  the  PHAT  weighting,  the 
energy  of  the  cross-correlation  over  all  time  delays  is  unity,  and 
afk  can  be  estimated  by  subtracting  the  energy  of  the  peaks  from 
the  total  energy. 

elk  =  E  rik(n)  -  E(®ifc  )2  =  1  -  E^)2  (11) 

n  l  l 

Intuitively,  the  best  channel  pairs  to  use  are  those  with  the  lowest 
energy  noise  floor  or  alternatively  those  pairs  with  the  highest  peak 
energy. 

It  turns  out  that  (11)  is  related  to  the  early  energy  to  total  en¬ 
ergy  ratio  of  the  impulse  response. 

n  fo0m8W)fdt 

Jo°°m]2di 


This  measure  is  often  used  to  determine  intelligibility  of  sound 
when  designing  acoustic  spaces  [7],  Intuitively,  the  best  micro¬ 
phone  pairs  are  those  with  the  highest  early  energy  to  late  energy 
ratio. 

To  recap,  a  block  diagram  of  Hybrid  Localization  can  be  seen 
in  fig.  2.  First,  PHAT  is  used  to  find  Np  time-delay  estimates 
for  each  microphone  pair.  The  metric  (11)  determines  which  three 
microphone  pairs  SX  uses  to  estimate  the  set  of  initial  candidate 
locations.  Using  the  remaining  microphone  pairs,  the  time  delays 
that  correspond  to  the  candidate  locations  are  determined  using 
(9).  Finally,  the  candidate  locations  are  re-estimated  and  SRP- 
PHAT  is  used  to  test  these  locations  for  the  one  with  maximum 
energy. 

5.  SIMULATIONS 

The  hybrid  algorithm  was  tested  in  both  simulated  and  real  scenar¬ 
ios.  The  resulting  estimates  were  compared  with  those  obtained 
using  SRP-PHAT  [4]  and  SI  [2],  The  SRP-PHAT  candidate  lo¬ 
cations  are  chosen  in  a  non-linear  optimal  fashion  based  on  the 
Nyquist  rate.  This  method  requires  fewer  candidate  locations  then 
a  linear  spacing.  Hybrid  source  localization  was  performed  us¬ 
ing  increasing  values  of  Np  from  2  to  6.  The  resulting  candidate 
locations  were  pruned  to  match  the  region  of  interest  used  in  SRP- 
PHAT  and  the  previous  location  estimate  was  added  to  the  candi¬ 
dates.  The  error  is  defined  as 

E  =  i||rs  -  rs||2  (13) 

where  L  is  the  number  of  frames.  While  the  hybrid  algorithm 
could  be  used  to  find  multiple  sources,  only  the  single  source  case 
was  tested  in  this  paper. 

The  simulated  room  had  dimensions  of  4  m  by  5  m  by  3  m. 
The  omni-directional  microphones  are  placed  in  the  middle  of  each 
wall  at  the  elevation  of  1  m  and  0.1  m  below  the  ceiling  at  a  dis¬ 
tance  of  0.1  m  from  the  wall.  The  region  of  interest  is  defined  as  a 
box  with  dimensions  of  2  m  by  2.5  m  and  0.4  m.  In  order  to  ade¬ 
quately  test  the  space,  SRP-PHAT  requires  about  20,000  points. 

The  source  locations  were  placed  in  random  locations  inside 
the  region  of  interest.  The  sources  consist  of  human  speech  by 
both  males  and  females  in  English  and  French.  Human  speech 
is  not  omni-directional,  so  the  directivity  data  from  [9]  was  used. 
The  impulse  response  was  created  using  the  image  method.  The 
ceiling  had  an  absorption  coefficient  of  0.3  and  the  floor  had  a 
coefficient  of  0.7.  The  walls  had  an  absorption  coefficient  ranging 
from  0.05  to  0.3,  which  results  in  a  reverberation  time  of  430  ms 
to  270  ms  calculated  using  Sabine’s  formula.  Low  frequency  noise 
was  added  to  the  signal  at  a  SNR  of  40  dB. 

The  resulting  estimation  errors  for  the  various  algorithms  can 
be  seen  in  fig.  3.  The  hybrid  algorithm  used  Np  =  3  time  de¬ 
lays.  It  can  be  seen  that  the  hybrid  algorithm  has  approximately 
the  same  error  as  SRP-PHAT,  while  both  methods  vastly  outper¬ 
form  SI. 

6.  EXPERIMENTS  IN  AN  ACTUAL  ROOM 

The  hybrid  algorithm  was  also  tested  in  an  acoustically  untreated 
room.  The  test  data  included  human  speakers  standing  in  marked 
locations,  so  that  their  location  could  be  easily  determined.  Their 
voice  was  recorded  by  eight  microphones  spread  out  on  one  wall 
and  the  ceiling.  The  region  of  interest  was  defined  as  a  box  with 
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Fig.  3.  Results  of  single  source  simulation  for  several  absorption 
values,  a. 


Method 

MSE  of  Location  (mz) 

SI 

3.3858 

SRP-PHAT 

0.3346 

Hybrid,  Np  =  2 

0.3709 

Hybrid,  Np  =  3 

0.3125 

Hybrid,  Np  =  4 

0.3020 

Hybrid,  Np  =  5 

0.4004 

Hybrid,  Np  =  6 

0.3780 

Fig.  4.  Table  of  real  room  results  using  SI,  SRP-PHAT,  and  the 
new  Hybrid  algorithm  for  several  values  of  Np. 

a  volume  of  9  m3.  For  this  experiment,  SRP-PHAT  required  only 
5000  non-linearly  spaced  points. 

As  can  be  seen  in  fig.  4,  even  with  Np  =  2,  the  resulting  er¬ 
ror  of  Hybrid  Localization  is  statistically  insignificant  when  com¬ 
pared  to  the  computationally  more  expensive  SRP-PHAT.  It  vastly 
outperforms  the  SI  algorithm.  So  Hybrid  Localization  retains  the 
robustness  of  SRP-PHAT. 

One  concern  for  real-time  localization  is  the  speed  of  com¬ 
putation.  By  counting  the  number  of  operations  required  to  esti¬ 
mate  locations,  it  can  be  shown  that  Hybrid  Localization  is  much 
faster  than  SRP-PHAT.  To  increase  the  speed  of  SRP-PHAT,  a  ta¬ 
ble  look-up  method  is  used  to  find  the  delays,  which  are  calculated 
beforehand.  Fig.  5  shows  the  number  of  points  that  can  be  in 
the  region  of  interest  for  SRP-PHAT  to  have  an  equivalent  com¬ 
putational  cost  compared  to  Hybrid  Localization.  In  this  case  M 
represents  the  number  of  channel  pairs.  With  eight  microphones, 
M  =  28;  so  Hybrid  Localization,  with  Np  =  2,  is  equivalent  in 
cost  to  searching  290  points.  The  room  experiment  required  5000 
points  for  the  look-up  table  so  Hybrid  Localization  requires  a  tenth 
of  the  computation  cost. 

7.  CONCLUSION 

Hybrid  Localization  is  a  good  compromise  between  robustness  and 
ease  of  computation.  It  uses  SX  to  create  a  set  of  candidate  loca- 
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Fig.  5.  This  table  shows  how  many  points  can  be  sampled  for  SRP- 
PHAT  to  be  equivalent  to  Hybrid  Localization  for  a  given  number 
of  channel  pairs  and  number  of  peaks.  If  more  points  are  required 
than  Hybrid  Localization  is  more  efficient. 

tions  to  be  used  in  SRP-PHAT.  This  algorithm  combines  the  best 
aspects  of  SX  and  SRP-PHAT.  It  greatly  reduces  the  computation 
cost  of  SRP-PHAT,  while  still  retaining  the  robustness.  The  new 
hybrid  algorithm  is  an  effective  solution  for  robust  real-time  source 
localization. 
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